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DECOMPILING  WITH  DEFINITE  CLAUSE  GRAMMARS 


S.T.Hood 


SUMMARY 

Decompiling  is  the  process  of  deriving  a  computer  program  in  a  high-level  language  from 
one  in  machine<ode  or  assembly  language.  Defence  applications  of  d;  ''nipiling  include 
maintenance  of  obsolescent  equipment,  production  of  scientific  and  te..'!  rical  intelligence 
and  assessment  of  systems  for  hazards  to  safety  or  security.  TT  s  p  per  describes  an 
approach  to  the  rapid  generation  of  decompilers  through  the  ui.^  of  Definite  Clause 
Grammars.,  a  class  of  abstract  grammars  which  can  be  executed  as  P'  ’-j-/  programs.  The 
approach  is  illustrated  using  "toy"  languages.  An  environmer .  «  L.jh  permits  the 
integration  of  diverse  sources  of  knowledge  relevant  to  the  decompila  on  problem  and 
provides  a  graphical  interface  is  described. 
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1  INTRODUCTION 

The  emerging  discipline  of  software  engineering  envisages  computer  software  being  developed 
from  a  statement  of  requirements,  through  several  stages  of  formal  specification,  coding  and 
testing.  There  are,  however,  situations  which  demand  the  assessment  and  possible 
modification  of  the  final  products  of  the  development  process,  for  example  executable 
machine  code,  in  the  absence  of  any  other  descriptions  of  the  system.  The  process  of 
(reknreating  higher  level,  that  is,  more  abstract,  descriptions  of  the  system,  which  may  only 
have  exist^  in  the  origitud  designer’s  mind,  is  called  reverse  engineering. 

Recent  references  to  the  need  for  reverse  engineering  of  software  include  bringing  large  bodies 
of  existing  code  under  the  umbrella  of  computer-aided  software  engineering  (CASE)  sv'stems 
[Bachman  1988].  For  many  enterprises,  the  body  of  existing  co.le  represents  a  large  inv  .stment 
and  may  embody  corporate  knowledge  not  recorded  elsewhere.  This  has  spawned  a 
significant  iiidust^  providing  tools  and  contracted  expertise  supporting  activities  such  as  the 
transformation  of  "spaghetti  code"  into  well-structured  programs  and  the  translation  of 
programs  in  old  languages  into  ones  which  are  supported  on  modem  systems  [Kotik  and 
Markosian  1989).  WMle  some  of  this  work  has  been  concerned  with  assembly  languages,  for 
example  on  IBM  mainframes,  most  has  concerned  higher-level  languages  such  as  FORTRAN. 

A  problem  closer  to  the  subject  of  this  paper,  namely  the  recovery  of  a  higher-level  language 
program  from  executable  machine  code,  was  that  tackled  by  several  groups  in  the  Unit^ 
States  in  determining  the  behaviour  and  stmeture  of  the  notorious  "Internet  Worm"  program 
[Spafford  1988).  This  work  was  done  without  the  aid  of  any  automated  tools,  apart  from  the 
use  of  the  UNIX  C  compiler  for  checking  hypotheses  (Eugene  Spafford,  private 
communication). 

Defence,  with  the  longevity  of  its  equipment,  non-standard  embedded  processors  and 
requirements  for  rapid  modifications  in  response  to  new  threats,  countermeasures  or  operating 
environments,  has  a  particular  interest  in  reverse  engineering  tools  and  techniques.  Reverse 
engineering  of  software  is  also  relevant  to  the  production  of  scientific  and  technical 
intelligence  and  to  the  assessment  of  otherwise  "black  box'  systems  for  hazards  to  safety  or 
security.  The  requirements  for  quick  reaction  and  secrecy  raised  by  many  of  these  applications 
argues  for  powerful  tools  which  can  be  rapidly  custondsed  to  suit  the  problem  at  hand  and 
which  permit  the  job  to  be  completed  by  a  small  number  of  analysts. 

We  also  note,  without  comment,  the  increasingly  common  practice  of  proscribing  reverse 
engineering  of  licensed  software,  exemplified  in  the  following  quotation  from  the  "License 
and  limited  warranty  agreement"  printed  in  the  reference  manual  for  a  personal  computer 
electronic  mail  system;  "You  may  not ..  .  reverse  engineer,  disassemble,  decompile,  or  make 
any  attempt  to  discover  the  source  code  to  the  software^'  [CE  Software  198‘'i. 

Decompiling  is  the  process  of  transforming  a  program  expressed  in  assembly  language  or 
machine  code  into  a  description  in  a  high-level  language  such  as  Ada  or  C  which,  with  the 
aid  of  a  suitable  compiler,  can  be  transformed  into  the  original  code.  This  paper  describes 
some  experiments  in  using  the  language  Prolog  for  the  rapid  construction  of  decompilers. 
Familiarity  with  Prolog  notation  is  assumed  and  the  reader  is  referred  to  a  textbook  such  as 
Sterling  and  Shapiro  [1986).  Sufficient  Prolog  code  is  included  to  permit  the  reader  to 
experiment  with  and  extend  the  examples  discussed. 
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z  language  PROCESSING  wrmrROLcx: 

Prolog  was  originally  developed  as  a  tool  for  implementing  natural-language  understanding 
systems  (Colmerauer  19751.  The  reader  is  referr^  to  the  paper  by  Pereira  and  Warren  (1980] 
or  any  Prolog  textbook  [Sterling  and  Shapiro  1986)  for  an  introduction  to  the  topic.  In  general, 
the  construction  of  useful  natural-language  processing  systems  is  still  largely  a  research 
activity,  with  Prolog  being  the  tool  of  ciwice  for  some  major  current  projects  [Alshawi,  Moore, 
Moran  and  Pulman  1988).  In  general,  programming  languages  are  considerably  simpler  than 
natural  languages,  so  that  the  construction  of  compilers  in  Prolog  is  quite  straightforward 
[Warren  1980,  Cohen  and  Hickey  1987|. 

Most  programming  languages  produced  after  Algol-60  have  their  syntax  defined  by  a  formal 
context-free  grammar,  normally  expressed  in  a  notation  called  Backus-Naur  Form  (BNF)  (Aho 
and  Ulman  1977].  The  power  of  Prolog  for  language  processing  is  conveyed  by  the  fact  that  it 
allows  a  notational-variant  of  BNF,  called  "grammar  rules",  to  be  simply  transformed  into  a 
Prolog  program  which,  when  executed,  accepts  syntactically  correct  programs  (or,  in  general, 
"sentences").  Prolog  grammar  rules  actually  d^ine  an  extension  of  context-free  grammars 
called  definite-clause  grammars  (OCCs),  with  the  descriptive  power  of  a  general  purpose 
computer.  In  practice,  most  commercial  compilers  of  Prolog  are  themselves  written  in  Prolog. 

Grammar  rules  have  a  single  Prolog  temt,  representing  a  non-terminal  symbol  of  the  grammar 
on  the  left  hand  side  of  the  arrow,  while  the  right-hand  side  contains  terms  representing 
other  non-terminals,  Prolog  lists  representing  sequences  of  terminal  symbols  and  arbitrary 
Prolog  code  enclosed  in  braces  used  to  apply  constraints  or  i’^'tlement  side-effects.  Granunar 
rules  are  normally  translated  into  executable  Prolog  by  augmenting  non-terminal  symbols 
with  additional  arguments  representing  the  input  list  o*  symbols  and  the  list  following  the 
recognised  symbol,  while  terminal  symbols  are  translated  into  a  format  wherein  they  appear 
at  the  head  of  the  input  list  [Pereira  and  Warren  1980].  Prolog  code  enclosed  in  braces  is 
unchanged  in  translation.  Most  Prolog  interpreters  or  compilers  recognise  and  translate 
grammar-rules  interspersed  with  clauses  in  standard  notation.  Others  may  provide  a  library 
procedure  for  the  purpose.  For  example,  the  rule: 

nonterminiil_l  (Attribute)  — > 

nonterminal_2 (Attribute) ,  [terminal] ,  (constraint (Attribute) 1 . , 
which  may  be  read  as: 

"nonterminal_l  (Attribute)  can  replace  nonterniinai_2  (.Attribute)  followed  by  the 
symbol  'terminal',  provided  that  constraint  (Attribute)  is  satisfied", 

might  be  translated  into: 

nonterminal_l (Attribute, X,V) 

nonterminal_2 (Attribute, X,Z) , 

'C  (Z,  terminal,  Y) , 
constraint (Attribute) . 

where  the  system  predicate  'C  (for  "connects")  is  defined  by: 

'C  ((XIS],X,S)  . 

DCGs  are  executed  top-down  much  like  a  recursive-descent  parser  for  a  context-free  granunar. 
The  above  rule  would  be  invoked  with  variable  X  bound  to  a  list  of  symbols.  If  the  rule  is 
successfully  applied,  variable  Y  would  become  bound  to  the  list  of  symbols  following  those 
subsumed  by  nonterminal_J  (Attribute). 
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2.1  A  Toy  Compiler 

Sterling  and  Shapiro  (1986)  provide  the  complete  Prolog  code  implementing  a  multi¬ 
pass  compiler  for  a  toy  language,  PL,  with  a  syntax  similar  to  a  subwt  of  Pas^  [jensen 
and  Wirth  19751  into  a  fictitious  nuclune  instruction  set.  Their  example  is  based  on  one 
due  to  Warren  (1980].  Their  machine  instruction  set  provides  a  single  accumulator  with 
both  immediate  and  direct  memory  addressing  and  conditional  branch  instructions.  The 
compiler  of  Sterling  and  Shapiro  has  been  extended  to  accept  source  code  from  a  file  and 
to  generate  assembly  code  formatted  for  the  convenience  of  the  decompiler,  as 
explained  below.  The  compiler  comprises  a  tokeniser,  code-generator  and  assembler,  in 
addition  to  the  DCG  parser  shown  in  Figure  1.  Figure  2  shows  an  example  of  the  input 
and  Figure  3  the  invocation  of  the  compiler  with  the  resulting  absolute  machine  code 
and  symbol  table. 

2.2  A  Toy  Decompiler 

Figure  4  shows  a  DCG  parser  which  accepts  the  output  of  the  above  compiler,  building 
in  the  process  a  description  of  the  software  as  a  Prolog  term  containing  Pascal-hke 
control  structures.  The  non-terminals  associated  with  arithmetic  operations 
(arich_exp  and  arict>_^op)  take  as  their  third  arguments  terms  of  the  form: 

X  ->  Y. 

where  X  and  Y  unify  with  the  content  of  the  accumulator  before  and  after  the  relevant 
operation  respectively.  This  notation  is  readily  extended  to  a  more  complex  processor 
by  providing  arguments  to  represent  additional  registers  or  flags.  The  information 
needed  to  construct  the  decompiler  was  obtained  through  inspection  of  the  compiler. 

The  parsing  of  standard  control  structures  often  requires  the  recognition  of  a  jump  to  the 
next  instruction  following  the  parsed  sequence.  While  DCGs  are  quite  capable  of 
performing  calculations  on  addresses,  it  is  more  elegant  to  incorporate  in  each  assembler 
instruction  its  address  and  the  address  of  the  location  following  the  instruction.  This 
allows  control  structures  to  be  recognized  using  only  unification  (so  that  symbolic  labels 
can  be  used  if  desired)  and  permits  instructions  of  varying  lengths.  The  program 
requires  its  input  in  the  form  of  a  Prolog  list.  Producing  the  required  format  from 
standard  assembly  listings  would  be  a  trivial  task  using  Prolog  or  standard  UNIX  data- 
manipulation  tools  (Bourne  19831. 

Figure  5  shows  the  structure  built  by  this  "decompiler"  when  given  the  code  from  Figure 
3  and  the  results  of  formatting  this  according  to  the  syntax  of  PL  using  a  simple  pretty- 
printer.  Ihis  output  can  be  successfully  recompiled.  The  decompiler  handles  all  of  the 
constructs  allowed  in  the  PL  language. 
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/• 

parse (Tokens, Structure) 

Structure  represents  the  successfully  parsed  list 
of  Tokens. 

•/ 

parse (Source, Structure) 

pl_jpro9ram(Struct'tre,  Source,  (]>- 

pl_progcam(S)  — >  (program],  identifier (X) ,  (*;*),  statement (S) . 

statement ( (S; Ss) )  — > 

(begin],  statement (S) ,  rest_3tatements (Ss) . 
statement (assign (X,V) )  — > 

identifier (X) ,  (':”'],  expression (V) . 
statement (if (T,  SI,  S2) )  — > 

(if],  test(T),  (then],  statement (SI) ,  (else],  statement (S2) . 
statement (while (T,  S) )  — > 

(while],  test(T),  (do],  statement (S) . 
statement (read (X) )  — > 

(read],  identifier (X) . 
statement (write (X) )  — > 

(write],  expression (X) . 

rest_statements ( (S;S3) )  — >  (';']»  staterent (S) ,  re3t_3tatemento (Ss) . 
re3t_3tatement3 (void)  -->  (end].:  ” 

expression (X)  — >  pl_oonstant (X) . 
expression  (expr  (Op,  X,T) )  — > 

pl_constant (X) ,  arithmetic_op(Op) ,  expression (Y) . 

arithmetio_op( '+' )  — >  ('+']. 
arithmetic_op ( '-' )  — >  ('-']• 
arithmetio_op('*')  -->  ('*'}. 
atithmetic_op ( ' / ' )  — >  ('/')• 

pl_con3tant (name (X) )  -->  identifier (X) . 
pl_con3tant (number (X) )  — >  pl_integer (X) . 

identifier (X)  —>  (X],  (atom(X)l. 
pl_integer (X)  — >  (X],  (integer (X) ) . 

test  (compare  (Op,  X,Y) )  — > 

expression (X) ,  comparison_op(Op) ,  expression (Y) . 

comparison_op ( '“' )  -->  ('“']■ 
compari3on_op ( '> ' )  — >  ('>']• 
oompari3on_op  ( '<' )  — >  ('<’].■ 
compari3on_op( )  — >  ('>“')• 
comparison_op ('”<')  — >  ['"<')• 

Figure  1.  A  DCG  parser  for  the  PL  language  (Sterling  and  Shapiro  1986). 
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program  factorial; 
begin 

read  value; 
count  1; 

result  :«  1; 

while  count  <  value  do 
begin 

count  :»  count  +  1; 

result  :«  result  *  count 

end; 

write  result 

end. 


Figure  2.  A  PL  program 


?-  c£ile(£Actorial _pi) . 

( (value, 18) , (count, 19) , (result, 20) 1 


[0, 1, road, 18] , 
(1,2,  loadc, 1] , 
(2,3, store, 19] , 
(3,  4,  loadc, 1] , 

[4>  5, store, 20] , 
(5,  6,  load, 19], 

(6, 7, sub, 18] , 
(7,8,  jumpge,15], 
[8,9,  load,  19] , 
(9,10,addc,l], 
[10, 11, store, 19] , 
(ll,12,load,20], 
(12,13,mul,19], 

[ 13, 14, store, 20] , 
[14, 15, jump, 5] , 
[15,16,load,201, 
(16, 17, write, 0] , 
(17, 18, halt, 0] 

1  .  yes 


Figure  3  Script  of  a  Prolog  session  (user  input  in  italics)  showing  the  invocation  of  the 
PL  compiler  with  the  file  factorial_pl  containing  the  code  shown  in  Figure  2.  The  output 
comprises  the  symbol  table  followed  by  the  assembled  object  code. 
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?-  opt’d,  Kfx,  op(':0.  yfx,  (*>»),  :p(932,  £y,  r.ozi,  op(13C,  £x,  vati. 

aeosrptCooe,  PL)  pl_prsg<I,  ?w  Coae, 


pl_pro5(?,  Q,  prsglXli  — >  pj_Jra9<P,  ?1,  X»,  IIP!,  5,  halt,  Ol;.. 
pL_iraglP,  Z,  Xt  -->  sintiP,  0,  X). 

p:~£rag(P,  R,  (X/Yi)  -->  stmctP.  Q,  X),  p:_fra9(0,  R,  Y) . 


st.T.:  (P, 
St-TltP, 


stnc (P, 
SCSI (?, 


SI.Tlt  (?, 


scfflo (P, 


str.c  (?, 


Q,  reaa(var(A) ) )  -->  I'P,  0,  read.  A,;. 
5,  write <var (A1 ) )  — > 

(:?,  PI.  load.  A!,  iPl,  Q,  write,  Oi!. 
a,  .loop)  — >  (iP,  a,  noop,  _! ) . 

C,  while (Test,  3cll  — > 

Oraneh_i£_not (_,  _,  Test,  0), 
pl_ftaq(_,  _,  Co), 


liPl,  Q,  junp.  P.'I. 
a,  if  (Test,,  Then,  Else)) 
brar.ch_r£_not  (_,  Test, 
pi_£raq(_,  _,  Then), 

;  _<  21 1 , 
pl_frag(Pl,  a.  Else) . 
a,  if (Test,  Then))  — > 
branch_l£_not (_,  _,  Test, 
pl_frag(_,  Q,  Then), 
a,  Asqn.s)  — > 
arith_exp(P,  PI,  _  ->  X), 


PI), 


a). 


asstgn_seq(?l. 


a. 


Asgns,  X) . 


asslgn_seq(P,  a,  var(A)  !■  3,  X)  — > 

■!?,  Pi,  store.  All,  assign_seqlrl,  0,  3,  X). 
assign_saq(?,  a,  var(A)  i*  X,  X)  — >  (1?,  0,  store.  All. 


Braneh_lf_not (p,  a.  Test,  R) 

aritr._exp(?,  _,  _  ■>  A) ,  arith_op(_,  _,  A  ->  A-BI,, 

\  ooTpar.son_opeoae (Comp,  JimpOp),,  f-nc_args (Test,,  Cobp,  A,  8)  ), 
,a,  .’t.-.pop,  R!  1 . 


aritn_exp(?,  a,  A)  — >  arith_op(?,  a.  A). 
ar.t.n_exp(?,  a,  A  ->  C)  — > 

ar.tr_op(P,,  PI,  A  “>  B) ,  atith_exp(Pl,  a,  B  ->  C) . 

at. t.'._cp(P,  a,  .  •>  A)  ”>  UP.  a,  loadc.  All. 
ar.tn_op(?.  a,  _  •>  vat (A))  — >  !iP,  0,  load.  All. 
a:;t.h_op(P,  a,  A  ->  E)  — > 

.?,  a.  Op.  Bll, 

I  .lteral_operation (Sym,  Op),  func_arqs(E,  Sym,  A,  B)  1, 
ar.tr._op(P,  0,  A  ->  E)  --> 

; :p,  a.  Op,  Bll, 

‘  Temory_opecatlon (Sym,  Op),  func_args(E,  Sym,  A,  vat(B))  ). 

coTparison_op!;ode(  jumpne) .  compariscn_opcode('>‘,  jumple)  . 
co.Tpa:ison_opcode('>-',  jumplt)  .  conparison_opcode( jumpqe) , 
corparison_opcode( , jumpqtl . 

literal_operatron ( '»',addc) .  literal_operatron( subc) . 

.iteral_operat.or ( ' •',mjlc) .  1 iteral_operat ion (•/', dive) . 

memory_operat lone*', add)  .  memory_oper4tlon('-',sub) . 
memory_operation( ‘ '',mul) .  memoty_operation(*/*,dlv) . 

t  access  components  of  Term  «  Functor(Arql,  Arq!)  (more  efficient  than  ) 
fune_arg9 (Term,  Functor,  Arql,  Arg2) 
functor (Fact,  Functor,  21, 
argd.  Fact,  Argl), 
arg(2.  Fact,  Arg2) . 


Figure  4  A  DCG  parser  for  decompiling  the  output  of  the  PL  compiler 
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prcg(  (readCvar  18); 

(var  19  •:=  1; 

(var  20  1; 

(while(var  19  <  var  13, 

(var  19  :■=  vat  19  +  1;, 

var  20  var  20  *  var  19));, 

write (var  20) ) ) ) ) ) 

(a)  The  structure  created  ducrng  the  decompilation  of  code  of  Figure  3 
using  the  grammar  of  Figure  4. 

program  thing; 
begin 

read  var 18; 
varl9  1; 
var20  :=  1; 

while  varl9<varl8  do  begin 
varl9  varl9+l; 
vac20  :=  var20*varl9 
end  ; 

write  var20 

end . 

(b)  The  above  term  pretty-printed  in  a  format  acceptable  to  the  PL 
compiler. 


Figures 


2.3  Compiler  Optimisations 

One  of  the  problems  real  decompilers  will  face  is  the  handling  of  compiler 
optimisations.  This  is  illustrated  by  a  simple  case  which  is  handled  by  our  toy 
decompiler.  The  compiler  described  above  translates  the  PL  sequence: 


a  :  =  b  ; 
c  :=  b 


into  the  assembler  sequence: 


load  b 
store  a 
load  b 
store  c 

An  optimising  compiler  would  recognise  that  the  second  load  is  redundant  and  would 
remove  it.  The  rule  for  recognising  assignment  statements  in  the  decompiler  accepts  the 
"optimised"  code: 


load  b 
score  a 
store  c 
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and  translates  it  into  the  form: 


o  :=  a  :=  b 

which  could,  if  desired,  be  transformed  into  two  separate  assignments  to  match  PL 
syntax  rules. 


A  full  discussion  of  the  decompilation  of  optimised  code  is  beyond  the  scope  ot  this 
paper. 


3.  DECOMPILING  "SMALL-C"  FOR  THE  INTEL  8085 

The  instruction  set  used  in  the  above  example  is,  unfortunately,  rather  different  from  those  ot 
typical  microprocessors.  In  order  to  provide  a  more  realistic  evaluation,  the  Intel  8085  8-bit 
microprocessor  (Intel  1977]  was  chosen.  A  convenient  high-level  language  was  provided  by 
the  public  domain  Smal!-C  compiler,,  which  accepts  a  large  subset  of  the  C  language 
(Kernighan  and  Ritchie  1978|.  We  describe  below  the  construction  of  decompilers  for  two 
subsets  of  Small-C  including  while  and  if-lhen-else  control  structures  and  assignment 
statements  with  arithmetic  expressions.  The  fust  employs  only  static  integer  variables 
(variables  are  assigned  addresses  in  memory),  the  second  oniy  automatic  variables  (variables 
are  assigned  on  the  system  stack  so  that  storage  for  them  is  created  and  destroyed  on 
procedure  entry  and  exit).  A  full  decompiler  for  the  language  would  include  rules  for  both 
classes  of  vanables  as  well  as  character  and  pointer  data-types,  more  complex  expressions 
and  the  remaining  control  structures. 

3.1  Static  Variables 

Figure  6  shows  a  C  version  of  the  factonal  program  (without  the  read  and  write 
commands  for  simpliaty)  and  a  fragment  of  the  8085  assembly  language  generated  by 
the  Small-C  compiler.  One  noticeable  difference  from  the  toy  instruction  set  is  that  the 
store  step  in  an  assignment  statement  is  now  delocalised,  with  the  address  calculated 
and  pushed  onto  the  stack  prior  to  evaluation  of  the  expression  whence  it  is 
subsequently  popped  for  use. 

Figure  8  shows  the  input  and  output.  The  assembly  language  was  formatted  manually 
using  a  text  editor.  Arbitrary  numerical  addresses,  which  are  only  used  for  unification 
so  they  could  be  any  unique  symbols,  have  been  used  in  the  address  fields,  while  labels 
for  variable  locations  and  riin-time  routines  have  been  left  in  their  original  form  by 
declaring  as  a  prefix  operator.  It  is,  perhaps,  interesting  to  note  that  Pascal-like 
code  has  been  produced  by  decompiling  the  output  of  a  C  compiler  (although  the  subset 
of  C  we  have  chosen  is  easily  mapped  into  Pascal). 
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main  ( ) 

( 

static  int  value,  count,  result; 
value  ”  10; 
count  -  1; 
result  “  1; 

while  (count  <  value)  I 

count  ”  count  +  1; 
result  =  result  *  count; 

1 

1 

(a)  A  Small-C  ptugrain  using  static  variables 

;  Small  C  8080; 

-;  Coder  (2.4,84/ll/20) 

;  Front  End  (2.7,84/11/28) 

extern  ?pinc 

cseg 

mam  :• 


;  Allocate  storage  for  variables  in  data  segment. 


?2:. 

dseg 

da 

?3;. 

cseg 

dseg 

ds 

?4  j 

cseg 

dseg 

ds 

cseg 

;  Load  HL  register-pair  uith  address  of  variable . 

}  and  save  it  on  the  stack. 

1x1  h,?2 

push  h 

Load  HL  regi  ster-cair  with  value  10. 

Ixi  h,10 

;  Load  DE  register-pair  with  address  of  variable 
;  and  call  run-time  routine  to  store  an  integer. 

pop  d 

call  ?pir.t 

(b)  808S  assembly  language  produced  by  the  Small-C  compiler  from  the  program  of  (a), 

up  to  the  end  of  the  first  assigiuncnt  statement  (count  ■  10).  Added  comments  are 
shown  in  italics. 


'  Figures 
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Figure  7  shows  (he  grammar  rules  of  the  decompiler.  In  this  implementation  of  C,  the 
HL  register  assumes  the  rote  of  accumulator  in  anthmebc  operations. 


aecs-p(Csae,  .=L)  sil_prsgl_,  Ciie,  i. 

p._Ers9i.=,  i,  pro3(X))  — >  pl_;ragi?,  X!,  .rt,  0,  rat 
p._::igi?,  X)  -->  Q,  X) . 

r.._:ra9i?,  9,  X;Y|I  — >  str.tiP,  S,  Xi ,  p^_frag(i,  a,  Y>. 

sYTYir#  w.  *r.**e(TesY.  lien  --> 

ora.-c.'_lf_.-!3t  (?,  ?■-,  Test,  Q), 
p._;raq(?l,,  ?i,  Del, 

.?2.  ;'p,  ?.  . 

j;-:  (?,  ;  .tiTes:,  T.-.er.,  Z.sen  — > 

Dra.-.cr._.r_r.st;  1?,,  fl.  Test,  ?3>, 
p.fragiPT,  ?2,  Then), 

?2,  ?3,  ;rp,  C;  , 
pl_:ra9<?3,  J,  Else). 
st“tl?,,  S,  .Idest,  T.nen) )  — > 

cra"c."._.:_;  at  1?,  ?!,,  Test,  C), 
p-_l.ac5(rl,  w,  .r.en). 
st-t  i?,  ;,  var  IA>  : •  X)  --> 

.  ?,  1x1,  .A  , 

?■.,  ?2.,  p.s.-., 

a:  .t.''._expiE2,  ?3,  _  Xi,, 

?3,  ?*i,  p'tp,  0  , 

J,  cal.,,  Ipint,  . 

c'a.'c.'_.;_.'c:  (?,,  J,  Test,,  A)  — > 

3r;t.“._exp(?,  ?1,  _  •■>  A), 

?!,,  ?5,,  p.sn,  .n)*,, 
a.'.t." _exp (?2,  ?3,  _  •>  H),, 

?4,,  cop,,  d;,.?4,,  ?5,  cal-,,  .S.cr;,, 

■  :c*p_so(Conp,  S.or),  :-nc_args (Test ,  Coitp,  A,  3) 

-6,  rev,,  a,,  hi,,  1?6,  P**,,  era,  .Pi,,  Q,  ;z,  Rj !  . 

;c-o_-..t  ( '  • ' oq)  .  cosip.op) '<•,  ?lc). 
cc.*ip_ct  ( '  ^ COT, p_op ('»<',  ?le)  . 
co*p  Jqel  .  coenp^op  ( * !  * ' ,,  Tne)  , 

ar  expi?,,  Q,  A)  arith_op(?,  Q,  A), 
ar.-.".  exp(?,,  Q,  A  ->  C) 

ar.th_op(?,  PI,,  A  ->  3),,  ar ich_exp (PI .  I,  3  »>  C)  . 

ar.:."._op(?,,  Q,  _  •>  A)  — >  [iP,  Q,  ixi,,  h,  Al  , . 
ar.;.“._sp(P,  Q,  _  ->  var(A))  — > 

?,,  Ixi,  Al,  [_,  3,  call.  ?gintl.. 
ar .t.''._cp(?,  3,  A  E)  — > 

P,  p.s.".,  nil, 

a:  •  t."._exo  (_,  _,  _  ->  B) ,, 

.  _,  pop,  dil, 

ao_arl:n_op(_,  3,  Op), 


•  tar.c  args  (E, 

Cp,  A. 

B)i 

ao_aric!i_op(P,  Q, 

')  — > 

li?. 

3,  aad,  d| ! . 

ao_arit."._op(?,  3, 

')  — > 

iiP, 

3,  call,  ?scc 

do_arith_op(P,  0,  '*' 

I)  —> 

i:p. 

3,  call,  Tn.l 

oo_aritn_op(P,  3,  '/' 

ij  — > 

iip. 

3,  call,  ?div 

Figure  7.  A  decompiler  which  accepts  8085  assembly  language  from  a  subset  of  the 

Small-C  language  using  only  static  variables. 
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3.2  Automatic  Variables 

Automatic  variables  are  accessed  by  calculatii^  an  from  the  current  value  of  the  stack 
pointer  (a  register  known  as  SP  in  the  case  of  the  8085).  The  decompiler  grammar  is 
complicated  by  the  need  to  account  for  changes  in  SP  as  the  stack  is  used  for  temporary  storage 
during  expression  evaluation,  as  shown  in  the  annotated  assembly  code  in  Figure  9.  It  is 
apparent  that  the  number  of  pushes  during  a  C  statement  is  balanced  by  the  number  of  pops,  so 
it  is  sufficient  to  include  an  extra  variable  in  the  non-terminals  used  in  arithmetic  expressions 
(Figure  10)  to  record  SP  decrements  (by  2  with  each  push)  to  decide  which  variable  is  being 
accessed. 


Small  C  8080; 

Coder  (2.4,84/11/27) 
Front  End  (2.7,84/11/28) 
extern  ?pint 


caeg 

main: 


/  Establish  a  stack  frame  for  3  integer  variables.- 

push  b 

push  b 

push  b 

/■  Calculate  variable  address  as  offset  from  current 
!  Stack-pointer  (SP)  and  save  it  on  the  stack., 

Ixi  h,  4 

dad  sp 

push  h 

/  Load  HL  register-pair  with  value  10 
1x1  h, 10 

;  Load  DE  register-pair  with  address  of  variable 
;  and  call  run-time  routine  to  store  an  integer. 

pop  d 

call  ?pint 


Figure  9  8085  assembly  language  produced  by  the  Small-C  compiler  from  the  program 

of  Figure  6a,  (but  with  variables  declared  int ,  rather  than  static  ini)  up  to  the  end  of  the 
first  assignment  statement  (count  =  10).  Added  comments  are  shown  in  italics. 


Formatting  the  code  of  Figure  9  as  a  Prolog  list  in  the  manner  of  Figure  8a,  and 
employing  the  grammar  of  Figure  10  in  the  decompiler,  we  obtain  a  listing  identical 
(apart  from  the  names  of  variables)  to  that  of  Figure  8b. 
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cl_crcgt_»  Ccae#  V* 

proqt?,  'i.  proqiXM  — > 
p^sr.esiPf  S) » 

pi  ! ?4q  tS**  P1-*  ' 

pops'?,  ?1, 

Q.,  ret.  ,, 

\^t4.te(-.-m_of^vera  -  N*»* 

fraqi?,  0.  XI  ">  a'-'t'?*  C.  XI. 

■fraqi?.,  S.  iX.-Yll  "> 

s-.—.i?,  2,  X), 

р. _^!rag  (2.  XI . 

--,  I?,  2.  «r..e(Tes;,  2c)  I  ”> 

cra'cc  E*l*  *6at»  3)* 

с. _fraii?V.  ??-  -o)., 

,  ?i,  2.  '.‘P.  •  • 

•tif.  2,  .:ites;,  ;.".ef..  Else!)  ■-> 

crarcr.  .i  rotl.=  .,  ?-<  2esc,,  ?3|,, 

р.  ;cdq(?l.  P2.  Tr-ec) * 

?2,  r3,  ‘.pp,  2'.. 

с.  lraqi?3.  0,  i-so) . 

r'.i?,  .fl.'est.  tr.er.)  I 

c'‘arc“  r.Cv)?.  Test.  2). 
p._:r4ql?'-,  2.  rr.or.l . 

2.  .'arCAl  !•  X| 

?,  r'.,  -xi.  r./  As. 

?2,,  daa,  spl,. 

?2,’  ?3.  p'-sft, 

iz  .IT.  axpt?3,  r4,  ^  *>  X»  2)., 
P'lP*  p1". 

PS,  S,  oai.,  ’. . 

aren  il  r.ot(P,  Q,  Test,  PI  "> 
''ar-C'._exp(P.  PI.  _  ”>  *•  ®l. 
■PI,  P2,  push,  hll . 
arith_expl?2,  ?3,  _  '>  3,  2), 
i  i?3,,  P4,  pop,  <31, 

1P4,  ?5,  call,  Subril, 

( 

conp  vp'Cotnp,  Subr), 
f«nc  ergs (Test,  Comp,  A,  9) 

K 

_  ?5,  ?6,  mov,^  a,  .M », 

.?€,  P'’,  ora,  1!, 

91.  Q.  :z,  B)] . 


COTO  opC-‘,  3eq).  cspp_opC<', 
acrp"op('>‘,  Iqoi,  CSTP_0F1’'<',  3.e). 
ccr.plopl'>«',  2qe).  carp.cp  ( ' ' « ' ,  2re) 

ari'!._exp<P,  3.  A,  S)  — > 

arlch^opIP,  Q,  A,  £) . 
arich^axplP,  3,  A  '>  C,  3)  “*> 

arith_op(?,  P'-,  A  *>  3,  S) ,, 
arii:h_axp l?l,  2,  3  ",  ■ 

atich  optP,  2.  _  ">  A.  S)  — > 

;!P,  0.  ixi,  !'■,  a;  .  • 

arlch  cp(?,  0,  _  “>  vat(Ai,  5) 

;!?,  _,  1x1,  n,  x:, 
das,  spi. 

(2.  2,  call,  ?g..-.c.,, 

(pi js  <A,  S,,  XI  t . 
arlth_op(?,  0,  A  «>  E,  SI  --> 

:iP,  ,,  Puah,  n 
ipllsiS,  2,  31)  (,, 
dcich_expl_,  _,  _  3,  31), 

!!  ,  ,  pop,  di',. 
do_arIch_op(_,  3,  3p) , 
i,irno_args(E,  Op,  A,  31). 

<10  aricn  opiP,  3,  ’•') 

ilP,  2,  dad.  O'), 
ao  acith  eplP,  0,  '•’I 

UP,  5,  da.l,  ?s.fc,  . 
do  a:iih_oplP,  2, 

("ip,  2,  ca.l,  2*.-  • 

do  arich.oplP,  Q,  ‘^'i 

*?,,  3,  oa.l,.  .’c..'  • 

pushesiP,  2,  NO,  N)  **> 

UP,  PI,  pash,  bi,, 
pashes IPl,  2,  NO,  Nl),, 
(plastNl,  1,  N) ) . 
pushesiP,  2,  N,  N)  -■>  ,i, 

pops!?,  2,  NO,  N)  > 

:  i?,  PI.  pop.  til , , 
popsIPl,  2,  NO,  S’.),, 
ipluslNl,,  1,  N)l . 
pcpsiP,  2,  N,  N)  -> 


Flguic  10  A  "decompiler"  for  8080  assembly  language 
Smpiler  using  only  autoi^tic  variables.  Decompilation  results  m  a  gerenc  ^ock- 
struS  r^p^ntation  which  can  be  formatted  to  produce  a  language  of  choice  teg  Pascal). 
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4.  INTERACTIVE  DECOMPILING 


4.1  Using  the  Prolog  Database 

The  standard  approach  to  translating  DCC  rules  into  executable  Prolog,  which  has  been 
used  in  the  examples  thus  far,  requires  the  input  assembler  code  to  be  represented  as  a 
Prolog  list.  While  this  normally  provides  faster  execution  than  other  approaches 
[Pereira  and  Warren  1980),  it  has  some  disadvantages. 

We  have  not  considered  here  many  of  the  processes  which  might  be  required  prior  to 
decompilation,  such  as  the  separation  of  code  from  data  [Horspool  and  Marovac  1980), 
or  the  determination  of  the  semantics  of  run-time  procedures  not  part  of  the  available 
code.  It  is  suggested  that  the  reverse-engineering  of  a  large  program  will  require 
significant  interaction  with  a  human  analyst  over  a  considerable  period  and  that  more 
rapid  progress  will  be  made  on  some  sections  of  the  code  than  on  others. 

The  reasoning  applied  by  a  human  analyst  might  be  sometimes  bottom-up,  or  data- 
driven: 

"that  small  procedure  has  two  8-bit  XOR  instructions;  it  probably  is  doing  a  76-bit 
XOR", 

and  at  other  times  top-down  or  hypothesis-driven: 

"let's  assume  this  was  written  in  P/./M-80". 

The  model  of  the  reverse-engineering  process  which  we  have  in  mind  appears  to  be  a 
close  fit  to  the  so-called  Blackboard  Model  which  had  its  origins  in  computer 
understanding  of  speech  (Nii  1986A,  19868). 

Pereira  and  Warren  (1980)  note  that  lists  are  not  the  only  way  of  representing  sequences 
of  symbols  in  implementing  DCC  parsers.  In  particular,  if  individual  instructions  are 
stored  as  facts  in  the  Prolog  database,  any  rule  of  a  decompiler  grammar  to  be  applied 
starting  at  any  instruction,  allowing  a  combination  of  top-down  and  bottom-up  parsing. 
Further,  as  pointed  out  by  Pereira  and  Warren,  when  a  rule  has  been  successfully 
applied  we  can  add  a  fact  containing  the  recognised  structure  to  the  database  so  that  it 
can  be  considered  by  other  rules  without  repeating  the  computation  required  in  its 
recognition.  Parsers  which  employ  such  tables  (or  charts)  of  already  recognised  well- 
formed  substrings  are  often  called  chart-parsers  (Winograd  1983). 

Figure  1 1  is  an  interpreter  for  a  DCC  grammar  which  expects  input  symbols  to  appear  as 
facts  in  a  ternary  relation  e  (standing  for  "edge":  chart-parsing  jargon),  referred  to 
here  as  the  chart.  In  order  to  avoid  confusion,  a  different  arrow  symbol  has  been  used  in 
the  grammar-rules.  Note  that  this  interpreter  inspects  the  chart  for  a  required 
structure  before  looking  for  a  rule.  Recognised  PL  statements  are  added  to  the  database, 
but  smaller  fragments  are  not.  This  appears  to  be  a  reasonable  approach,  given  the 
relative  expense  of  the  a.ssert  operation.  These  assertions  of  deduct  results  into  the 
database  are  within  the  spirit  of  logic  programming  as  they  do  not  change  the  meaning 
of  the  program. 
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3ub3tr(P,  Q,  (A,  A3)1 

substctP,  PI,  A), 

3ub3tr(Pl,  Q,  As), 
substr  (P,  P,  (] )  . 
substt(P,  P,  (X!)  X,  ! . 
sub3tr(P,  Q,  e(P,  Q,  T) )  sub3ti(P,  Q,  T) 
substriP,  Q,  T)  «(P,  Q,  T) ,  !. 
substr(P,  0,  T) 

(T  — >  A) , 

3ub3Cr(P,  Q,  A), 

(T  »  3cmt(S)  ->  new_adge(e<P,  Q,  T) )  ;  true). 

new_edge(E)  :■*  E,  !. 
new_edge(E) 

asserts (E) , 
write (E) , 
nl 

Figure  11  Chart-parser  implemented  as  an  interpreter  of  DCCs. 


The  DCGs  employed  in  the  previous  examples  required  each  non-terminal  contain 
parameters  representing  the  current  and  next  "address".  As  this  is  information  is  now 
included  in  the  chart,  it  need  only  appear  in  rules  when  it  is  necessary  for  recognising 
control  structures.  In  such  cases,  direct  reference  is  made  to  the  chart  representation. 
Generally  rules  are  less  cluttered  than  in  the  previous  notation. 

Figure  12  shows  the  grammar  of  Figure  10  in  the  new  format.  Figure  13  shows  the 
assembly  code  in  the  format  for  use  with  the  chart  parser.  Note  that,  as  with  the  list- 
based  representation,  the  arguments  (I,  J)  which  link  successive  facts  e(I,  J, 
insc  ruction)  are  used  only  for  unification  so  they  can  be  any  unique  terms.  Figure  14 
shows  a  script  of  a  Prolog  session  in  an  environment  containing  the  clauses  of  Figures  11, 
12,  4b  and  13.  Running  this  example  (with  the  printing  of  messages  disabled)  on  an 
Apple  Macintosh-Plus  under  Advanced  A1  Systems  Prolog  requires  4.08  seconds  of  CPU 
time.  If  no  recognised  structures  are  sav^  in  the  chart,  the  time  increases  to  7.35 
seconds.  For  comparison,  the  time  taken  for  the  equivalent  problem  using  the  list-based 
representation  (Figure  10)  is  2.1  seconds.  Execution  speed  using  the  chart  representation 
could  be  substantially  improved  by  "compiling"  grammar  rules  into  equivalent  Prolog 
clauses,  as  is  done  for  the  list-based  representation.  For  our  present  purposes,  the  use  of 
an  interpreter  facilitates  experimentation. 
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Figure  12  The  grammar  of  Figure  10  re-cast  for  use  with  the  chart-parsing  interpreter  of 

Figure  11. 
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e(l,  2, 

(push. 

bl)  . 

6(33, 

34, 

(jz» 

73] )  . 

e(2,  3, 

[push. 

bj) . 

6(34, 

35, 

[Ixi, 

h,2]> . 

6(3/  4, 

[push. 

hi) . 

6(35, 

36, 

[dad. 

spl ) . 

6(4,  5/ 

[Ixi, 

h,41) . 

6(36, 

37, 

(push, 

hi)  . 

6(5,  6, 

[dad. 

spl) . 

6(37, 

38, 

(Ixi, 

h,41)  . 

6(6,  7, 

(push. 

hi). 

6(38, 

39, 

[dad. 

spl)  . 

6(7,  8, 

[Ixi, 

h,101)  . 

6(39, 

40, 

[call. 

7gintl ) 

6(8,  9, 

(pop  , 

d)>. 

6(40, 

41, 

(push. 

hi)  . 

6(9,  10, 

(call. 

7pint) )  . 

6(41, 

42, 

(Ixi, 

h,ll)  . 

6(10,  11, 

(Ixi, 

h,2))  . 

6(42, 

43, 

(pop. 

dl)  . 

6(11,  12, 

(dad. 

spl)  . 

0(43, 

44, 

[dad. 

dl)  . 

6(12,  13, 

(push. 

hi)  . 

6(44, 

45, 

(pop. 

dl)  . 

6(13,  14, 

f  Ixi, 

h,l])  . 

6(45, 

46, 

(call. 

7pintl ) 

6(14,  15, 

(pop. 

dl)  . 

6  (46, 

47, 

[Ixi, 

h,0])  . 

6(15,  16, 

(call, 

7pintl )  . 

6(47, 

48, 

(dad. 

spl )  . 

6(16,  17, 

(Ixi, 

h,01)  . 

6(48, 

49, 

[push. 

hi)  . 

6(17,  18, 

(dad. 

spl)  . 

6(49, 

50, 

(Ixi, 

h,21)  . 

6(18,  IS, 

(push. 

hi). 

6(50, 

51, 

[dad. 

spl )  . 

6(19,  20, 

(Ixi, 

h,ll)  . 

6(51, 

52, 

[call. 

7giP.t) ) 

6(20,  21, 

(pop. 

dl) . 

6(52, 

53, 

(push. 

hi)  . 

6(21,  72, 

(call. 

7pint) ) . 

6(53, 

54, 

[Ixi, 

h,6])  . 

6(72,  23, 

(Ixi, 

h,23)  . 

6(54, 

55, 

[dad. 

spl )  . 

6(23,  24, 

[dad. 

spl)  . 

6(55, 

56, 

[call. 

7girit) ) 

6(24,  25, 

(call. 

7gintl )  . 

6(56, 

57, 

(pop. 

dl)  . 

6(25,  26, 

[push. 

hi)  , 

6(57, 

58, 

(call. 

7mul) )  . 

6(26,  27, 

[Ixi, 

h,61). 

6(58, 

59, 

[pop. 

dl)  . 

6(27,  28, 

(dad. 

spl)  . 

6(59, 

60, 

[call. 

7pint j ) 

6(23,  29, 

(call. 

7gintl )  . 

6(60, 

73, 

( jmp. 

72] )  . 

6(29,  30, 

(pop. 

dl). 

6(73, 

62, 

(pop. 

bl)  , 

6(30,  31, 

(call. 

71t]) . 

6(62, 

63, 

(pop, 

bl)  . 

6(31,  32, 

(mov. 

a,h)) . 

6(63, 

64, 

(pop. 

bl)  . 

6(32,  33, 

(ora. 

11). 

6(64, 

65, 

(retl ) . 

Figuic  13  8085  assembly  instructions  from  the  Small-C  compiler  formatted  as  prolog 

facts  for  application  of  the  chart  parser  of  Figure  11  and  the  grammar  of  Figure  12 
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?-  substrd,  pi _prog<S)),  pprint(S). 

e (4, 10, stmt (var4 :»10) ) 
e (10, 16, stmt  <var2 :-l) ) 
e (16, ?2, stmt (vatO :“1) ) 
e (34, 46, stmt (var2 :»var2+l) ) 
e (46,  60, stmt (varO :“vatO*var2) ) 

e{?2,  ?3,3tmt (while (var2<var4, (var2:-var2+l;varO:“varO*var2) ) > ) 
number  of  automatic  variables>3 
program  thing; 
begin 

var4  10; 

var2  1; 

varO  1; 

while  var2<var4  do  begin 
var2  :»  var2+l; 
varO  !”  var0*var2 

end 

end. 

S  “  prog((var  4  :«  10; 

(vac  2  1; 

(var  0  1; 

while (vac  2  <  var  4, 

(var  2  var  2  +  1; 
var  0  :»  vat  0  »  var  2)))))) 

■>. 

Figure  14  Script  of  a  prolog  session  (user  input  in  italics)  showing  the  invocation  of  the 
chart-parser.  Recognised  structures  are  printed  as  they  are  asserted  in  the  database. 


interactive  decompilation  would  be  greatly  assisted  by  the  analyst  being  able  to 
indicate  graphically  the  point  in  the  assembly  or  machine  code  at  which 
decompilation  should  start  and,  perhaps  to  select  the  pro^um  construct  to  be  recognised 
from  a  menu.  Figure  15  shows  a  prototype  interactive  decompilation  environment 
written  in  (Quintus  MacFrolog  on  an  Apple  Macintosh  11  computer.  Apart  from  code 
concerned  with  the  user  interface,  the  Prolog  program  is  thai  of  the  chart-parser 
described  above.  The  window  labelled  "ASM-80"  displays  in  conventional  format  the 
8085  assembly  language  represented  internally  as  in  Figure  13.  The  window  labelled 
"Grammar"  displays  ^e  grammar  of  Figure  12  for  browsing  and,  when  appropriate, 
refinement  and  extension.  The  analyst  has  placed  the  cursor  on  the  line  labelled  "4"  to 
indicate  the  start  point  for  an  attempted  decompilation  and  has  selected  "stmt(_)"  (see 
Figure  12)  as  the  syntactic  category  to  be  recognised.  The  successfully  recognised  PL 
structure,  in  this  case  an  assignment  statement,  is  displayed  in  the  window  la^lled  "PL 
Statement"  and  the  assembly  code  which  it  spans  is  highlighted  in  the  "ASM-80" 
window.  The  "Summary"  window  displays  the  current  results  of  decompilation. 
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Figure  15  InteracUvs  decompiling  environment.  See  text  for  detedls. 
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4.2  Application  Domain  Semantics  •  The  Symbol  Table 


The  design  of  a  computer  program  is  only  partially  captured  in  the  structure  of  its  source 
code.  In  order  to  make  it  understandable,  it  must  be  related  to  the  application  domain. 
This  IS  typically  provided  by  the  choice  of  meaningful  names  for  variables  and  by  the 
insertion  of  comments.  The  interactive  environment  proposed  provides  for  this. 
Variable-:  and  labels  derived  during  the  decompilation  can  be  assigned  meaningful 
symbols  by  the  analyst  asserting  facts  of  the  form: 

symbol (var (1) ,  count), 
symbol (1,  start) . 

Comments  can  likewise  be  attached  to  segments  of  code  by  the  analyst  specifying  the 
"address"  range: 

commentd,  2,  'initialise  count'). 

Many  programming  languages  allow  constants  to  be  represented  by  symbolic  names, 
with  the  actual  value  defined  at  one  place  in  the  code.  This  simplifies  modification  of 
the  program  as  well  as  making  it  more  easily  understood.  Replacement  of  the  symbol 
with  its  numerical  equivalent  is  a  trivial  operation  for  the  compiler.  The  reversal  of 
this  process  is  in  general  difficult  and  requires  understanding  of  the  meaning  of  the 
program  in  the  application  domain.  Unusual  (so-called  magic  numbers)  or  recurring 
values  might  be  brought  to  the  attention  of  the  analyst  for  possible  replacement  by 
symbolic  constants.  The  recognition  of  constants  with  prosaic  values  which  might  also 
arise  from  many  unrelated  causes  (for  example  the  values  0  or  1)  would  require  deep 
understanding  of  the  program's  domain  semantics  and  is  unlikely  to  be  achieved 
automatically.  A  mechanism  for  replacing  numerical  values  with  symbols  requires  a 
means  of  refemng  to  the  values  to  be  replaced.  The  simplest  approach  is  to  edit  the 
assembly  or  machine  code  then  re-run  the  decompiler  over  the  modified  structures. 

The  prototype  interactive  environment  described  above  permits  the  user  to  assign 
symbolic  names  to  variables.  Tnese  are  stored  n  a  sj/mbol  table  and  used  when  PL  code 
is  displayed.  Figure  16  shows  the  Macintos.'i  "Dialog  Box"  for  editing  the  symbol 
table.  Arbitrary  names  for  variables  in  the  "Summary"  window  (Figure  15)  have  been 
replaced  by  tpresumably)  meaningful  symbols.  Currently  a  one-to-one  correspondence 
between  variables  and  symbols  is  required. 
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Figure  16  Interactive  decompiling  environmentstx>ving  the  symbol-table 
editing  facility.  See  text  for  details. 
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Tho  above  arrangenveni  can  handle  static  variables  and  automatic  variables  (that  is,  those 
tor  which  storage  is  allocated  on  the  stack)  from  a  single  context  (procedure  or  block).  In 
general  automatic  variables  must  be  be  labellc'd  with  their  context,  for  example  var  (main, 
,1) .  This  requires  that  the  grammar-rules  concerned  bo  augmented  with  variables  to  record 
the  context  whenever  a  new  stack-frame  is  established. 


4.3  Recovering  Data  Types 

In  addition  to  recovering  the  control  structures,  the  decompilation  process  should 
attempt  to  recover  data  type  information.  Simple  (scalar)  data  types  include  those 
directly  supported  by  machine  instructions,  for  example  in  the  case  of  the  8085,  8  bit 
bytes  which  may  be  interpreted  as  characters  or  integers  and  16-bit  words  representing 
addresses  or  signed  or  unsigned  integers,  extended  precision  integers,  floating  point 
numbers  of  varying  precision,  ranges  of  integers,  enumerated  types  (such  as  days  of  the 
week)  and  sets,  typically  represented  as  bit-maps.  Complex  data  types  include 
cluracter  strings,  records  (or  structures)  and  arrays  of  the  afforementioned  simple  and 
complex  types.  Modem  languages,  particularly  those  claiming  to  be  object-oriented, 
allow  the  programmer  to  define  a  rich  hierarchy  of  types  (usually  called  classes)  to 
represent  concepts  in  the  problem  domain.  Wo  shall  not  consider  such  languages  here; 
rather  discussion  will  be  confined  to  the  data  typos  provided  for  in  languages  such  as 
Pascal. 

While  thiC  problem  of  recovering  type  definitions  is  somewhat  orthogonal  to  the  use  of 
DCCs,  it  is  relevant  to  the  question  of  the  practicability  of  decompilation  and  wo 
suggest  here  a  general  approach  which  involves,  for  scalar  types,  the  following 
processes: 


a.  the  recognition  of  the  storage  class  of  variables  (number  of  bytes  occupied, 
alignment)  from  the  instructions  used  to  access  it. 

b.  assigning  attributes  to  the  type  of  the  variable  according  to  the  operations 
which  are  jaerformed  on  it  (integer  or  floating  point  arithmetic,  comparisons,  bit¬ 
wise  logical  operations). 

c.  assigning  variables  to  the  same  class  where  they  are  used  in  operations 
together  and  where  attributes  already  assigned  are  compatible. 

d.  defining  a  class  of  variables  as  the  transitive  closure  of  the  same-class 
relation  defined  by  process  c. 

The  blackboard  model,  referred  to  in  section  4,  provides  a  suitable  framework  for  the 
application  of  such  heuristics  encoded  as  Prolog  procedures.  Attributes  oi  variables  can 
be  asserted  into  the  database  as  they  are  recognised.  Examples  of  attribute  assertions 
include: 

size (var (1) ,  2)  . 
size (var (2) ,  1)  . 

participate3_in (var (1) ,  int_arith) . 
participate3_in (var (2) ,  byte_compare) . 
a33igned_value (var (2) ,  1). 
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Recognised  type  compatibihties  can  be  asserted  thus: 

Sdme_type  (var  (1) ,  vac{3)). 

These  can  be  summarised  as  type  declarations  whenever  a  decompiled  listing  is 
requested. 

In  a  strongly-typed  language  such  as  Pascal,  integer  subrange  types  might  be  recognised 
from  run-time  tounds-checks  while  enumerated  types  might  be  inferred  from  the  set  of 
constants  assigned  to  variables  of  the  type.  Samples  of  values  assigned  to  variables 
obtained  from  the  run-time  environment  or  data  files  would  be  of  considerable  benefit  in 
determining  the  range  and  type  of  data,  although  relating  external  representations  to 
internal  values  would  require  a  deep  understanding  of  input/output  proc^ures. 

In  the  absence  of  knowledge  of  the  application  domain,  it  is  not  possible  to 
differentiate  an  integer  subrange  type  [1 ..  7)  from  an  enumerated  type  [sun,  mon,  tue, 
wed,  thu,  fri,  sat]  which  may  have  been  used  by  the  original  programmer.  Such 
semantics  may  in  some  cases  be  provided  as  hints  in  an  application  domain  knowledge 
base  (for  example,  in  an  application  dealing  with  dates,  enumerated  types  days  of  the 
week  and  months  of  the  year  might  be  expected,  as  well  as  subrange  types  (1  ..  31)  for 
days  of  the  month,  etc.l.  However,  as  with  the  assigning  meaningful  variable  names, 
significant  interaction  with  the  analyst  will  be  required.  Assertions  of  correspondence 
between  numeric  and  symbolic  values  for  enumerated  types  might  take  the  form: 

3yinbolic_value  (var  (2) ,  1,  monday) . 

The  comments  in  the  previous  section  regarding  automatic  variables  from  multiple 
contexts  apply  equally  the  recording  of  facts  relating  to  type  as  they  do  do  symbolic 
names.  In  addition  it  is  desirable  to  minimise  the  scope  of  static  variables  in 
reconstructing  declarations,  even  though  program  semantics  may  be  the  same  with 
variables  having  global  scope 


5.  CONSTRUCTING  DECOMPIUNG  GRAMMARS 

The  grammars  for  decompiling  8085  assembly  code  presented  here  have  been  discovered  by 
examining  the  code  generated  by  the  Small-C  compiler  from  known  fragments  of  source  code. 
In  addition,  run-time  procedures  were  identified  with  the  help  of  mnemonic  labels.  In  fact 
these  latter  were  generally  quite  short  and  their  functions  easy  to  determine.  Automating  the 
semantic  analysis  of  such  simple  run-time  procedures  should  not  be  difficult. 

There  will  be  some  cases  in  practice  where  the  compiler  used  to  generate  the  program  of 
interest  will  be  known  or  can  be  guessed  at.  Software  for  embedded  microprocessors  is  often 
compiled  on  development  systems  using  languages  provided  by  the  chip  manufacturer  or  by  a 
major  software  vendor.  Application  programs  on  general  purpose  computers  may  be  written 
using  the  sundard  system  programming  tools  (for  example,  C  under  UNIX). 

Further  expenmentation  will  be  needed  to  determine  the  difficulty  of  constructing  a 
decompiler  when  the  source  language  and  compiler  are  unknown  and  the  extent  to  which  this 
process  might  be  automated.  An  even  more  difficult  question  is  whether  it  is  practicable  to 
decompile  to  a  high-level  language,  code  which  was  actually  written  in  assembly  language 
or  which  nas  undergone  intensive  optimisation.  It  does  seem  likely  that  a  language  such  as  C, 
with  its  many  constructs  directly  reflecting  machine  operations,  would  be  a  more  promising 
target  in  this  case  than,  say,  Pascal. 
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6.  DISCUSSION 


This  paper  has  described  techniques  for  the  construction  of  decompilers  using  definite  clause 
grammars  compiled  or  interpret^  as  Prolog  programs.  The  principles  have  been  illustrated 
using  subsets  of  C  compiled  for  the  8065  microprocessor.  These  experiments  demonstrate  that 
the  use  of  DCGs  and  Prolog  is  a  viable  approach.  In  particular,  the  use  of  the  Prolog  database 
to  store  the  initial  assembly  (or  machine)  code  and  recognised  syntactic  structures  in  the 
manner  of  a  chart-parser  supports  an  interactive  approach  and  the  application  of  additional 
knowledge.  While  the  work  described  has  not  been  carried  through  to  the  completion  of  a 
decompiler  for  a  complete  real-world  programming  language,  Btushe  (1990)  describes  a  Prolog 
decompiler  for  a  significant  subset  of  PLM-80. 

Avenues  for  further  research  include  studies  of  different  combinations  of  languages,  compilers 
and  target  hardware,  methods  for  handling  compiler  optimisation,  the  possibility  of 
recognising  high-level  constructs  such  as  loops  and  if-then-else  statements  in  hand-written 
assembly  code,  and  the  use  of  heuristic  knowledge  of  both  the  programming  process  and  the 
application  domain.  Efficient  implementation  may  be  an  issue  for  larger  problems. 

Finally  we  note  that  other  approaches  to  the  rapid  construction  of  reverse  engineering  tools 
apart  from  the  use  of  Prolog  are  possible.  Other  symbol  manipulation  languages  such  as  Lisp 
could  be  chosen  as  a  starting  point,  but  generally  would  require  significantly  more  work  by  the 
system  builder.  Kotik  and  Markosian  (1989)  describe  the  application  of  the  REFINE 
programming  language  and  environment  to  software  re-engineering  problems.  REFINE 
provides  tools  for  the  construction  of  parsers  and  pretty-printers  from  context-free  grammars, 
and  a  rule-based  programming  style  for  semantic  processing  of  the  resulting  abstract  syntax 
tree,  however  its  purchase  cost  is  many  times  that  of  the  Prolog  systems  used  in  the  current 
work.  It  demands  a  powerful  workstation  with  a  copious  supply  of  memory  and  is  less 
portable  than  Prolog. 
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